\(\rightarrow\) Definition of the project
What have we learned so far?
On financial time series, we concluded that the returns and the standardized residuals are defined as white noise, it’s difficult to predict the future values of portfolio returns looking at past values. This postulate creates a mess up in student mind which are used to build a portfolio based on Markovitz theory. We need to review our entire allocation strategy.
The aim of this project is to answer all these questions but creating and managing 4 portfolios with different strategies allocation, the standard tangent portfolio of Harry Markovitz and regression based portfolios: Machine learning and simple regression model.
Welcome to our portfolio building strategy :).
The dataset consists of monthly financial information from January 31st, 1995 to February 26th, 2021 pertaining to 928 stocks around the globe. Each stock are characterized by their:
The first step of the study is to process the data. This step is crucial in portfolio building with machine learning. We need to check if the dataset contains missing points and handle it in a efficient way.
if(!require(naniar)){install.packages("naniar", "zoo", "rpart", "randomForest","xgboost",
"plotly", "PerformanceAnalytics", "fpp2",
"IntroCompFinR", "e1071")}
library(tidyverse) # Loading the package/environment
library(dplyr) # Grammar of data manipulation
library(naniar) # Missing data exploration
library(zoo) # Great package for time-series
library(rpart) # Simple decision tree package
library(rpart.plot) # Tree plot package
library(randomForest) # randomForest regression
library(xgboost) # Boosted trees regression
library(plotly) # Interactive graphs
library(PerformanceAnalytics) # To assess portfolio performance
library(fpp2) # moving average
library(IntroCompFinR) # MSR portfolio
library(e1071) # Statistic metrics kth central moment
load("data_full.RData") # Load the dataset
data_full # Show the data
data_full %>% summary # Summary of the data
## Tick Date Close Vol_1M
## Length:291392 Min. :1995-01-31 Min. : 0.00 Min. : 0.00
## Class :character 1st Qu.:2001-07-31 1st Qu.: 7.76 1st Qu.: 21.62
## Mode :character Median :2008-02-14 Median : 16.65 Median : 30.88
## Mean :2008-02-13 Mean : 64.74 Mean : 40.23
## 3rd Qu.:2014-08-29 3rd Qu.: 33.65 3rd Qu.: 46.31
## Max. :2021-02-26 Max. :180468.76 Max. :5052.37
## NA's :3416
## Mkt_Cap P2B D2E
## Min. : 0.0 Min. :0.00e+00 Min. : 0.0
## 1st Qu.: 356.6 1st Qu.:9.80e-01 1st Qu.: 16.9
## Median : 1486.0 Median :1.59e+00 Median : 51.4
## Mean : 12405.4 Mean :5.46e+00 Mean : 220.9
## 3rd Qu.: 6957.6 3rd Qu.:2.81e+00 3rd Qu.: 105.7
## Max. :2255969.1 Max. :1.24e+05 Max. :1188136.9
## NA's :5168 NA's :22664 NA's :12371
#The view of the summary table of the dataset highlights that we have missing points in the columns
#Vol_1M, Mkt_Cap, P2B and D2E of the dataframe. Which variables are most affected and in what
#proportion?
gg_miss_var(data_full) # Proportion of missing points per column (NA)
The two most affected columns are indeed P2B (price-to-book) and D2E (debt-to-equity). To avoid a concentration of data at some points after the scaling process, we’ll drop the Tick where no information are provided (entire column of a feature is empty). This will result to a loss of 5% of data which is an acceptable value. Whatever we have to fill the remain dataframe for the backtesting. To do this, we’ll use two techniques:
data_full <- data_full %>%
group_by(Tick) %>% # Perform the operation for each stock
mutate(Mkt_Cap = na.locf(Mkt_Cap, na.rm = F), # Replace the feature by previous values
Vol_1M = na.locf(Vol_1M, na.rm = F),
P2B = na.locf(P2B, na.rm = F),
D2E = na.locf(D2E, na.rm = F)) %>%
filter(!all(is.na(P2B)), !all(is.na(Mkt_Cap))) %>%
ungroup() %>% # Ungroup
group_by(Date) %>% # Perform the operation for EACH date
mutate(P2B = if_else(is.na(P2B), median(P2B, na.rm = T), P2B),
D2E = if_else(is.na(D2E), median(D2E, na.rm = T), D2E),
Mkt_Cap = if_else(is.na(Mkt_Cap), median(Mkt_Cap, na.rm = T), Mkt_Cap),
Vol_1M = if_else(is.na(Vol_1M), median(Vol_1M, na.rm = T), Vol_1M))
# Replace the feature value by the median at each date if NA
data_full %>% summary()
## Tick Date Close Vol_1M
## Length:276948 Min. :1995-01-31 Min. : 0.00 Min. : 0.00
## Class :character 1st Qu.:2001-07-31 1st Qu.: 7.90 1st Qu.: 21.84
## Mode :character Median :2008-02-14 Median : 16.87 Median : 31.25
## Mean :2008-02-13 Mean : 66.98 Mean : 41.82
## 3rd Qu.:2014-08-29 3rd Qu.: 34.17 3rd Qu.: 47.05
## Max. :2021-02-26 Max. :180468.76 Max. :5052.37
## Mkt_Cap P2B D2E
## Min. : 0.0 Min. :0.00e+00 Min. : 0.0
## 1st Qu.: 351.4 1st Qu.:9.90e-01 1st Qu.: 16.8
## Median : 1378.7 Median :1.59e+00 Median : 51.1
## Mean : 11239.3 Mean :6.74e+00 Mean : 300.0
## 3rd Qu.: 6047.9 3rd Qu.:2.84e+00 3rd Qu.: 106.4
## Max. :2255969.1 Max. :1.24e+05 Max. :1188136.9
The descriptive statistics of the features highlights a dispersion between the median and both the minimum and maximum (presence of outliers?). For instance, the maximum P2B is \(75*10^3\) times the median: this is a very huge number. The same observation holds for D2E. If these values an errors or due to very small values in the denominator? A low Equity compared to Debt or a low book value compared to market value. Also, there’s a huge discrepancies between them, Mkt_Cap \(10^6\), Vol_1M, D2E and P2B \(10^2\).
So, we have to rescale the predictors (features). We choose to work with positive values, but this is without much loss of generality. We scale the 4 variables with the ‘uniformisation method’: \(\tilde{x}_i=ecdf(x)\), where ecdf is the empirical cumulative distribution function of the sample.
normalise <- function(v){ # This is a function that 'uniformalises' a vector
v <- v %>% as.matrix()
return(ecdf(v)(v))
}
data_full <- data_full %>%
group_by(Tick) %>%
mutate(F_return = lead(Close) / Close - 1, # Compute forward returns
P_return = lag(F_return)) %>% # Compute past returns
na.omit() %>% # Take out missing data
ungroup() %>%
group_by(Date) %>%
mutate(FR_return = F_return - mean(F_return)) %>% # Relative return vs EW market
ungroup()
data_u <- data_full %>% # Data with normalized values
group_by(Date) %>% # Normalization takes place for each date
mutate_at(vars(Vol_1M:D2E), normalise) %>% # Apply normalization on features
ungroup()
data_u %>%
select(Vol_1M, Mkt_Cap, P2B, D2E) %>% # select necessary columns
pivot_longer(cols = everything(), # Convert to 'compact' mode
names_to = 'Attribute', values_to = 'Value') %>%
ggplot(aes(x = Value, fill = Attribute)) +
geom_histogram() + theme_light() + # Plot histograms
facet_grid(Attribute~.) # Stack the histograms
The distribution of the variables is highly affected by the scaling choice, we can observe a peak on D2E (due to outliers?) which will have a certain impact on the regression process. Although, we still keep it because its particular behavior can explained some patterns observed in the forecasting process. For the rest of the features the scaling seems ok. With no missing values and scaled data, we can start our portfolio construction process.
By definition, a portfolio is a choice of weights that sum to one. To assess realistic strategies, we’ll implement five classical portfolio, all based on different strategies most encountered in hedge funds. The weighting scheme follows signals led either by past returns or companies features. Our strategies are the following:
Equally weighted portfolio: the uniform portfolio which serves as benchmark. The EW weights are \(w=\frac{1}{N}\).
Maximum sharpe ratio portfolio: weighting scheme follow Markovtiz theory. It consists of a well-balanced maximization of the risk premium while reducing the overall volatility, hence leading to the highest sharpe ratio. It’s rebalanced every month according to new financial evolution and perspective of the companies and requires as inputs the (expected) mean returns \(\mu\) and covariance matrix \(\Sigma\) of the assets since 1995. The MSR weights are \(w=\frac{\Sigma^{-1}\mu}{1'\Sigma^{-1}\mu}\). Both \(\mu\) and 1 are vectors here.
Regression based-portfolio: we performed regression on asset’s returns with past data and the we forecast the future returns and choose to invest equally in the half of the assets that have forecasts above the cross-sectional median return. We take into account relative performance and not raw performance (the threshold is the median and not just zero). \(w=\frac{1}{N} (for \hat{Y} \ge 75\% -quantile(\hat{Y}))\).
XGBoosted trees: eXtreme Gradient boosting (XGBoost), is an aggregation of trees. Each new tree is constructed in an optimal way such that we end up to minimize the training error. Just as the regression method, we use relative performance. \(w=\frac{1}{N} (for \hat{Y} \ge 75\% -quantile(\hat{Y}))\).
Random Forest: are compilations of decision trees with randomly chosen predictors. In our sample, we have 4 predictors (Mkt_Cap, P2B, Vol_1M, D2E). In the allocation process, we combine randomly subsets of these features and grow as many trees as possible. We use the same proportion as the XGBoosted trees and regression: \(w=\frac{1}{N} (for \hat{Y} \ge 75\% -quantile(\hat{Y}))\).
Which feature guarantees a good splitting? To answer this question, we’ll build and plot two decision trees and compare them. One tree will be built on the original dataset and the other on the normalized data.
features <- c("Mkt_Cap", "P2B", "Vol_1M", "D2E") # Predictors
sep_date <- as.Date("2015-12-31") # Train vs test separation date
train_sample <- data_full %>% # Train sample
filter(Date <= sep_date)
test_sample <- data_full %>% # Test sample
filter(Date > sep_date) %>%
na.omit()
fit <- train_sample %>% # Initial data
select(features, FR_return) %>% # Select only the predictors
rpart(FR_return ~ ., # Model: F_Return as a function of all other variables
cp = 0.00001, # Complexity: lower means more nodes/levels
maxdepth = 7, # Maximum depth (nb levels) of the tree
data = ., # Data source: from the pipe
model = TRUE)
rpart.plot(fit) # Plot
The bold text details the splitting variables and points. They are displayed just below the clusters which they divide in two. The first split is made according to the Market Capitalization (Mkt_Cap) variable at level 0.0039. Those instances with Mkt_Cap larger or equal to 0.0039 go to the cluster to the left (almost all the dataset) and those with Mkt_Cap strictly smaller than 0.0039 go to the cluster to the right (less than 1% of the dataset). *.
With a dataset of 881 stocks, to achieve a certain level of splitting, we need to impose a smaller complexity \(10^{-5}\) with a maximum depth of 7 layers. Nevertheless, this design still incorporate a big clustering of 95% of the proportion of the sample with average forward relative return of -0.0055. The split captures a micro-phenomenon related to rebound of company after they experience a large drop in share price Synthetic Biologics Inc (SYN) in 2006, Cytocore common stock (MDIT) in 1998, this leads to average forward relative return from 7.3 to 27 which will not be meaningful in terms of generalization out-of-sample.
pred <- predict(fit, test_sample) # Predictions
mean(pred)-mean(test_sample$FR_return) # Bias
## [1] 0.007868397
var(pred)
## [1] 0.08640867
the first number, the bias shows the error in average prediction while the second, the variance one measures the dispersion of predictions. There are only 14 observations in this set, which is a perfect illustration of the variance-bias trade-off: the bias which is small (we stick to the data), but the variance is huge.
Below, we perform the same analysis on the formatted/normalized data and observe the difference. We also remove two layers of the tree since it isn’t significant.
train_sample <- data_u %>% # Train sample
filter(Date <= sep_date)
test_sample <- data_u %>% # Test sample
filter(Date > sep_date) %>%
na.omit()
fit <- train_sample %>% # Normalized data
select(features, FR_return) %>% # Select only the predictors
rpart(FR_return ~ ., # Model: F_Return as a function of all other variables
cp = 0.00001, # Complexity: lower means more nodes/levels
maxdepth = 5, # Maximum depth (nb levels) of the tree
data = ., # Data source: from the pipe
model = TRUE)
rpart.plot(fit) # Plot
With 5 layers, we end to a more efficient splitting than the non-normalized dataset. Whatever the big cluster with an average forward relative return still contains more than 95% of the dataset.
pred <- predict(fit, test_sample) # Predictions
mean(pred)-mean(test_sample$FR_return) # Bias
## [1] -9.506742e-05
var(pred)
## [1] 0.01329928
The problem of the variance-bias trade-off persists on the dataset. The magnitude of bias is divided by 82.77 in absolute value (from 0.79% to -0.95 basis point) while the variance is divided by 6.50 (from 8.64% to 1.33%). To improve the predictions, we have a look on the impact of hyperparameters on model performance and select the best ones.
\(\rightarrow\)Grid search
We start by preparing the inputs. Because we work with the simple (scaled) features, the dependent variable is the 1M (smooth) return since the portfolio is rebalanced monthly. Additionally, we will in fact use the return relative to the average performance (FL_return variable), to select the asset above the EW benchmark.
We recall some key hyperparameters of boosted trees:
train_features <- train_sample %>% select(features) %>% as.matrix() # Independent variables
train_label <- train_sample$FR_return # Dependent variable
train_matrix <- xgb.DMatrix(data = train_features, label = train_label) # XGB format
test_features <- test_sample %>% select(features) %>% as.matrix() # Independent variables
test_label <- test_sample$FR_return # Dependent variable
eta <- c(0.2, 0.4, 0.6, 0.8, 0.9) # eta values
nrounds <- c(10, 25, 50, 75, 90) # nb of trees
lambda <- c(0.01, 0.1, 1, 10, 100) # lambda values
pars <- expand.grid(eta, nrounds, lambda) # Exploring all combinations!
eta <- pars[,1]
nrounds <- pars[,2]
lambda <- pars[,3]
Below, we define the function that we will use for the grid search and test the triplet of parameters values.
grid_par <- function(eta, nrounds, lambda, train_matrix, test_features, test_label){
fit <- train_matrix %>%
xgb.train(data = ., # Data source (pipe input)
eta = eta, # Learning rate
objective = "reg:squarederror", # Objective function
max_depth = 4, # Maximum depth of trees
lambda = lambda, # Penalization of leaf values
gamma = 0, # Penalization of number of leaves
nrounds = nrounds, # Number of trees used
verbose = 0 # No comment
)
pred <- predict(fit, test_features) # Predictions based on fitted model & test values
return(mean((pred-test_label)^2)) # Mean squared error
}
grd <- pmap(list(eta, nrounds, lambda), # Parameters for the grid search
grid_par, # Function on which to apply the grid search
train_matrix = train_matrix, # Input for function: training data
test_features = test_features, # Input for function: test features
test_label = test_label # Input for function: test labels (returns)
)
grd <- data.frame(eta, nrounds, lambda, error = unlist(grd)) # Dataframe with all results
grd$eta <- as.factor(eta) # Parameters as categories (for plotting)
grd %>% ggplot(aes(x = eta, y = error, fill = eta)) + # Plot!
geom_col() + theme_light() +
facet_grid(rows = vars(nrounds), cols = vars(lambda)) +
geom_hline(yintercept = var(test_label), linetype = 2)
var(test_label) # Benchmark = zero prediction
## [1] 0.0183818
If we predicted a zero return all the times, we would make a quadratic error of 1.84%. From the facet grid, there is an emerging pattern: lambda (from left to right) and eta (the x-axis), which all seem to be the main drivers of the error at micro level. With lambda = 10-100, eta = 0.4-0.6 and nrounds = 10-25, the errors of prediction are very closed to the benchmark. We can learn patterns with those specific values.
\(\rightarrow\)Separating loss from performance
Since we rebalance our portfolio every month, we need to a better indicator: the hit ratio. It’s the percentage of times that we’re in the right direction, prediction of true positive (the hit ratio is positive) and false negative (the hit ratio is negative).
grid_par_hit <- function(eta, nrounds, lambda, train_matrix, test_features, test_label){
fit <- train_matrix %>%
xgb.train(data = ., # Data source (pipe input)
eta = eta, # Learning rate
objective = "reg:squarederror", # Objective function
max_depth = 4, # Maximum depth of trees
lambda = lambda, # Penalization of leaf values
gamma = 0, # Penalization of number of leaves
nrounds = nrounds, # Number of trees used
verbose = 0 # No comment
)
pred <- predict(fit, test_features) # Predictions based on fitted model & test values
return(mean(pred*test_label>0)) # Hit ratio!
}
test_label_hit <- test_sample$FR_return # Dependent variable => very different!!! Big trick!
grd <- pmap(list(eta, nrounds, lambda), # Parameters for the grid search
grid_par_hit, # Function of the grid search
train_matrix = train_matrix, # Input for function: training data
test_features = test_features, # Input for function: test features
test_label = test_label_hit # Input for function: test labels (returns)
)
grd <- data.frame(eta, nrounds, lambda, hit = unlist(grd)) # Dataframe with all results
grd$eta <- as.factor(eta) # Parameters as categories (for plotting)
# Plot!
graph <- grd %>% ggplot(aes(x = eta, y = hit, color = eta)) +
geom_point() + theme_light() +
ylim(0.45,0.53) +
geom_hline(yintercept = mean(test_label_hit>0), color = "#646464") + # Benchmark value (horiz line)
facet_grid(rows = vars(nrounds), cols = vars(lambda))
ggplotly(graph) %>% # Ease the reading
layout(legend = list(orientation = "v", x = 1.05, y = 1))
mean(test_label_hit>0) # Benchmark: assuming constant positive return (e.g., EW portfolio!)
## [1] 0.4692019
max(grd$hit) # Best hit ratio from model
## [1] 0.5251292
All hit ratios are above those of the benchmark of 46.92%. Indeed, many of them are above 50% (with a maximum hit at 52.51%) which is a good sign. We obtain even better result with the same parameters as mentioned above, eta = 0.6, lambda = 10, nrounds = 25. Whatever, this performance doesn’t take into account the transaction costs. They should be computed to make sure that the performance does not vanish with trading fees.
These hyperparameters of the XGBoosted trees model are now fixed, we can focus on the grid search of the Random forests model.
Random forests are more complex and richer than simple trees. They are compilations of decision trees with randomly chosen predictors and don’t have a huge tendency to overfit since each tree is build randomly. Thus, most of the time, their performance is higher.
\(\rightarrow\)Grid search
We recall some key hyperparameters of Random forests:
With random forest, we can’t test too much combination of parameters since it’s time consuming. Eg: If we select as the XGBoosted 5 values per parameters, with the computation time around 15s per triplet it takes \(\sim 0.25*5^3 = 31.25\) minutes (more than half hour). The nodesize is the most impactful parameter in computation time.
train_data <- train_sample %>% select(features, FR_return) # Dependent variable
test_features <- test_sample %>% select(features) # Independent variables
test_label <- test_sample$FR_return # Dependent variable
nodesize <- c(1, 3, 5, 10) # nodesize values
sampsize <- c(750, 1500, 3000, 5000) # sampsize of trees
ntree <- c(12, 25, 50, 100) # ntree values
pars <- expand.grid(nodesize, sampsize, ntree) # Exploring all combinations!
nodesize <- pars[,1]
sampsize <- pars[,2]
ntree <- pars[,3]
Below, we define the function that we will use for the grid search and test the triplet of parameters values.
grid_par_rdf <- function(nodesize, sampsize, ntree, train_data, test_features, test_label){
set.seed(42) # Fixing random seed
fit <- train_data %>% # Normalized data
select(features, FR_return) %>% # Take out non predictive variables
randomForest(FR_return~., # Same formula as for simple trees!
data = ., # Data source (pipe input)
ntree = ntree, # Number of random trees
mtry = 4, # Number of predictive variables used for each tree
replace = FALSE, # All different instances
sampsize = sampsize, # Training sample size for each tree
nodesize = nodesize # Minimum size of terminal nodes
)
pred <- predict(fit, test_features) # Prediction
return(mean((pred-test_label)^2)) # Mean squared error
}
grd <- pmap(list(nodesize, sampsize, ntree), # Parameters for the grid search
grid_par_rdf, # Function on which to apply the grid search
train_data = train_data, # Input for function: training data
test_features = test_features, # Input for function: test features
test_label = test_label # Input for function: test labels (returns)
)
grd <- data.frame(nodesize, sampsize, ntree, error = unlist(grd)) # Dataframe with all results
grd$nodesize <- as.factor(nodesize) # Parameters as categories (for plotting)
grd %>% ggplot(aes(x = nodesize, y = error, fill = nodesize)) + # Plot!
geom_col() + theme_light() +
facet_grid(rows = vars(sampsize), cols = vars(ntree)) +
geom_hline(yintercept = var(test_label), linetype = 2)
var(test_label) # Benchmark = zero prediction
## [1] 0.0183818
The quadratic error of 1.84% doesn’t change since we use the same test sample. From the facet grid, there is an emerging pattern, the most impacting variables are the sampsize (from top to bottom) and the nodesize (the x-axis). With sampsize = 1500-3000 and nodesize = 3 & 10, the errors of prediction are very closed to the benchmark. Now, let’s observe if this set have the same impact when applied with the hit ratio.
\(\rightarrow\)Separating loss from performance
We define a function with random forest setup.
grid_par_hit_rdf <- function(nodesize, sampsize, ntree, train_data, test_features, test_label){
set.seed(42) # Fixing random seed
fit <- train_data %>% # Normalized data
select(features, FR_return) %>% # Take out non predictive variables
randomForest(FR_return~., # Same formula as for simple trees!
data = ., # Data source (pipe input)
ntree = ntree, # Number of random trees
mtry = 4, # Number of predictive variables used for each tree
replace = FALSE, # All different instances
sampsize = sampsize, # Training sample size for each tree
nodesize = nodesize # Minimum size of terminal nodes
)
pred <- predict(fit, test_features) # Prediction
return(mean(pred*test_label>0)) # Hit ratio!
}
grd <- pmap(list(nodesize, sampsize, ntree), # Parameters for the grid search
grid_par_hit_rdf, # Function of the grid search
train_data = train_data, # Input for function: training data
test_features = test_features, # Input for function: test features
test_label = test_label # Input for function: test labels (returns)
)
grd <- data.frame(nodesize, sampsize, ntree, hit = unlist(grd)) # Dataframe with all results
grd$nodesize <- as.factor(nodesize) # Parameters as categories (for plotting)
# Plot!
graph <- grd %>% ggplot(aes(x = nodesize, y = hit, color = nodesize)) +
geom_point() + theme_light() +
ylim(0.45,0.53) +
geom_hline(yintercept = mean(test_label>0), color = "#646464") + # Benchmark value (horiz line)
facet_grid(rows = vars(sampsize), cols = vars(ntree))
ggplotly(graph) %>% # Ease the reading
layout(legend = list(orientation = "v", x = 1.05, y = 1))
mean(test_label>0) # Benchmark: assuming constant positive return (e.g., EW portfolio!)
## [1] 0.4692019
max(grd$hit) # Best hit ratio from models
## [1] 0.5121743
All hit ratios are above those of the benchmark of 46.92% just like the XGBoosted model. Indeed, all of them are above 50% (with a maximum hit at 51.22%) which is a good sign. However, it’s still nonsignificant in practice since it doesn’t take into account the transaction costs. From the results of the graph, we select nodesize = 3, sampsize = 3000 and ntrees = 100
These hyperparameters of the random forest trees model are now fixed, we can properly design the weighting function with the fixed hyperparameters.
We now turn to the definition of the weighting scheme. We wish to compare all the strategies, trees-based and MSR to the usual equally-weighted (EW) portfolio. We use the xgboost and randomForest algorithms to forecast future returns and form an EW portfolio of the best stocks with the highest predictions: 221 companies or equivalently, the top 25% (the sample has 882 stocks).
weights_reg <- function(past_data, curr_data, thresh){ # Weighting function based on regression
past_data <- past_data %>% select(features, F_return) # Data prior today
curr_data <- curr_data %>% select(features) # Today data
fit <- lm(F_return ~ .,
data = past_data) # Training on past data
pred <- predict(fit, curr_data) # Prediction
w <- pred > quantile(pred, thresh) # Investment decision: TRUE = 1
return(w / sum(w))
}
weights_xgb <- function(train_data, test_data, thresh){ # Weighting function based on ML process: XGBOOSTED trees
train_features <- train_data %>% select(features) %>% as.matrix() # Indep. variable
train_label <- train_data$FR_return # Dependent variable
train_matrix <- xgb.DMatrix(data = train_features, label = train_label) # XGB format
fit <- train_matrix %>%
xgb.train(data = ., # Data source (pipe input)
eta = 0.6, # Learning rate (chosen with tuning model)
objective = "reg:squarederror", # Number of random trees
max_depth = 4, # Maximum depth of trees (fixed)
nrounds = 25, # Number of trees used (chosen with tuning model)
lambda = 10, # Penalization of leaf values (chosen with tuning model)
gamma = 0 # Penalization of number of leaves (fixed)
)
xgb_test <- test_data %>% # Test sample => XGB format
select(features) %>%
as.matrix() %>%
xgb.DMatrix()
pred <- predict(fit, xgb_test) # Single prediction
w <- pred > quantile(pred, thresh) # Top 25% largest values, equally-weighted
return(w / sum(w))
}
weights_rdf <- function(train_data, test_data, thresh){
# Weighting function based on ML process: Random Forest
test_features <- test_data %>% # Take out non predictive variables
select(features)
set.seed(42) # Fixing random seed
fit <- train_data %>% # Normalized data
select(features, FR_return) %>% # Take out non predictive variables
randomForest(FR_return~., # Same formula as for simple trees!
data = ., # Data source (pipe input)
ntree = 100, # Number of random trees (fixed)
mtry = 4, # Number of predictive variables used for each tree (chosen with tuning model)
replace = FALSE, # All different instances
sampsize = 3000, # Training sample size for each tree (chosen with tuning model)
nodesize = 3 # Minimum size of terminal nodes (chosen with tuning model)
)
pred <- predict(fit, test_features) # Prediction
w <- pred > quantile(pred, thresh) # Top 25% largest values, equally-weighted
return(w / sum(w))
}
weights_msr <- function (returns){ # Weighting function based on Markovtiz theory: Maximum Sharpe ratio
ER <- returns %>%
select(Tick, Date, P_return) %>%
pivot_wider(names_from = Tick, values_from = P_return) %>%
head(-1) %>% # Select data prior the separation date to avoid forward looking
select(-Date) %>%
as.matrix() # Matrix of returns
Cov_mat <- ER %>% cov() # Covariance matrix
Pi <- tangency.portfolio(er = apply(ER, 2, mean), # Allocation function
cov.mat = Cov_mat,
risk.free = 0,
shorts = FALSE)
return(Pi$weights)
}
weights_multi <- function(past_data, curr_data, train_data, test_data, thresh, j){
# Strategies are indexed by j
if(j == 1){ # j = 1 => EW
return(rep(1/N,N))
}
if(j == 2){ # j = 2 => regression-based
return(weights_reg(past_data, curr_data, thresh))
}
if(j == 3){ # j = 3 => XGBoosted tree-based
return(weights_xgb(train_data, test_data, thresh))
}
if(j == 4){ # j = 4 => random Forest based
return(weights_rdf(train_data, test_data, thresh))
}
}
Backtesting portfolio strategies is usually simple at first, but becomes more intricate when many options are considered. We need to aggregate several arrays and indexing each frame to store weights and returns at exact place. This process becomes more complicated with the MSR strategy. To compute the weights, we should solve the covariance matrix \(\Sigma\), but it’s only possible for non singular matrix (more columns than rows). So, we have to first process a screening of stocks focusing on trend deriving from technical analysis before allocate the weights.
t_all <- unique(data_u$Date) # Set of dates
tick <- unique(data_u$Tick) # Set of assets
sep_date <- as.Date("2016-01-31") # This date separates in-sample vs out-of-sample
t_oos <- t_all[t_all > sep_date] # Out-of-sample dates (i.e., testing set)
Tt <- length(t_oos) - 1
nb_port <- 4 # Nb of portfolios
portf_weights <- array(0, dim = c(Tt, nb_port, length(tick))) # Store weights
portf_returns <- matrix(0, nrow = Tt, ncol = nb_port) # Store returns
Here we backtested the 4 first strategies with a loop:
strat <- c("EW", "REGRESSION", "XGBOOSTED", "RANDOMFOREST")
train_window <- 365 * 5 + 31
thresh <- 0.75 # Threshold on model
returns <- data_u %>% # Take data
select(Tick, Date, P_return) %>% # Select 3 columns
pivot_wider(names_from = Tick, values_from = P_return) # Put them into 'matrix' format
ret <- data_u %>%
group_by(Tick) %>%
mutate(Mov_avg = ma(P_return, order = 5))
for(t in 1:Tt){
realised_returns <- returns %>% # Same third step: returns
filter(Date == t_oos[t]) %>%
select(-Date)
train_data <- data_u %>%
filter(Date < t_oos[t] - 31, # One month offset because FR_Return
Date >= t_oos[t] - train_window) # Rolling window 5 years !!!
test_data <- data_u %>% filter(Date == t_oos[t]) # Current values
N <- test_data$Tick %>% n_distinct() # Numbers of assets
past_data <- data_u %>% # The source is data_u!
filter(Date < t_oos[t]) %>% # Past dates: expanding window!
select(-Date) # Take out dates
curr_data <- data_u %>% # The source is data_u!
filter(Date == t_oos[t]) %>% # Current date
select(-Date)
for(j in 1:nb_port){ # This is the novelty: we loop on the strategies
portf_weights[t,j,] <- weights_multi(past_data, curr_data,
train_data, test_data,
thresh, j)
# The weights' function is indexed by j
portf_returns[t,j] <- sum(portf_weights[t,j,] * realised_returns)
}
}
We defined a proper process on MSR strategy and proceed on backtesting.
strat_all <- c(strat, "MSR")
portf_weights_msr <- array(0, dim = c(Tt, 1, round((N+1)/4))) # Store weights
portf_returns_msr <- matrix(0, nrow = Tt, ncol = 1) # Store returns
ret <- data_u %>%
group_by(Tick) %>%
mutate(Mov_avg = ma(P_return, order = 5)) %>% # Moving Average of order 5
group_by(Date) %>%
mutate(Mov_avg_75 = quantile(Mov_avg, 0.75, na.rm = T))
k <- returns %>% dim() # Dimension of dataframe
for(t in 1:Tt){
l <- returns %>%
filter(Date < t_oos[t]) %>%
dim() # Dimension of dataframe
temp_data <- ret %>% # data prior the out of sample date
filter(Date <= t_oos[t]) %>%
group_by(Tick) %>%
filter(Mov_avg[l[1]-2] > Mov_avg_75[l[1]-2]) %>%
ungroup() # We are looking at corporate with moving average
#greater than the 75% quantile one month ago.
realised_returns <- temp_data %>% # realized returns at each date
select(Tick, Date, P_return) %>% # Select 3 columns
pivot_wider(names_from = Tick, values_from = P_return) %>%
filter(Date == t_oos[t]) %>% # realized returns at date t
select(-Date)
portf_weights_msr[t,1,] <- weights_msr(temp_data) # Keep the same weights for the entire holding period
portf_returns_msr[t,] <- sum(portf_weights_msr[t,1,] * realised_returns)
}
With all the strategies backtested, we can now assess the performance metrics.
To compare all the strategies, we’ll use an annual frequency performances. We’ll compute and focus on the following:
Average return: \(\bar{\mu}\) is the simple mathematical average of a strategy’ series of returns over a given period.
Volatility:\(\sigma\) is the statistical measure of the dispersion of returns for a given strategy. In most cases, the higher the volatility, the riskier the strategy.
Sharpe ratio: is the ratio of the average return earned in excess of the risk-free rate per unit of volatility or total risk. In general, the greater the Sharpe ratio, the more attractive is the strategy (we’re more rewarded on additional risk taken): \(\frac{\bar{\mu} - r_f}{\sigma}\)
Value at risk: is a measure of the risk of loss for investments. It estimates how much the strategy might lose at 95% probability, given normal market conditions, in a set time period: \(VaR_{\alpha} = \text{inf [V : P(X} \leq V) \ge \alpha]\)
Maximum Drawdown: is the maximum observed loss from a peak to a trough of a portfolio, before a new peak is attained. It’s an indicator of downside risk over a specified time period. \(MDD = \frac{\text{Trough value - Peak value}}{\text{Peak value}}\)
Turnover: is a measure of how frequently assets within a fund are bought and sold. It assesses asset rotation and thus impacts transaction costs. Simple turnover computes average absolute variation in portfolio weights: \(\text{Turn}=\frac{1}{T}\sum_{t=2}^T\sum_{n=1}^N|w_t^n-w_{t-1}^n|\). Full turnover takes into account the variation of the weights between rebalancing dates: \(\text{Turn}=\frac{1}{T}\sum_{t=2}^T\sum_{n=1}^N|w_t^n-w_{t-}^n|\), where \(t-\) is the time just before the rebalancing date. We’ll use the full turnover.
Transaction cost-adjusted SR: It’s the turnover taking into account the sharpe ratio. The higher the turnover and/or the lower is the sharpe ratio, the more penalized is the transaction cost adjusted, \(TC-SR=(\bar{r}-0.005*Turn)/\sigma\).
Beta: is a measure of the volatility or systematic risk of the portfolio compared to the market as a whole: \(\beta = \frac{cov(\mu,r_m)}{var(r_m)}\)
Alpha in CAPM: is a measure of the active return on an investment, the performance of that investment compared with a suitable market index (EW in our case).\(\alpha = (\bar{\mu} - r_f)- \beta * (\bar{r}_m - r_f)\) with \(r_f\) the risk-free rate and \(\bar{r}_m\) the average return of the benchmark.
perf_met <- function(portf_returns, p_weights, asset_returns, t_oos){
Avg_ret <- mean(portf_returns, na.rm = T)*12 # Arithmetic mean
Vol <- sd(portf_returns, na.rm = T)*sqrt(12) # Volatility
Sharpe_ratio <- Avg_ret / Vol # Sharpe ratio
VaR_5 <- quantile(portf_returns, 0.05)*sqrt(12) # Value-at-risk
maxDD <- maxDrawdown(portf_returns) # maximum Drawdown
Turnover <- 0 # Initialization of turnover
for(t in 2:(length(t_oos)-1)){
realized_returns <- asset_returns %>% filter(Date == t_oos[t]) %>% select(-Date)
prior_weights <- p_weights[t-1,] * (1 + realized_returns)
if(sum(prior_weights)>0){
Turnover <- Turnover + sum(abs(p_weights[t,] - prior_weights/sum(prior_weights)))
}
}
Turnover <- 12*Turnover/(Tt - 1) # Average over time
Trans_cost_SR <- (Avg_ret - 0.005*Turnover) / Vol # Transaction cost-adjusted SR
met <- data.frame(Avg_ret, Vol, Sharpe_ratio, VaR_5,
maxDD, Turnover, Trans_cost_SR) # Aggregation of all of this
return(met)
}
perf_met_multi <- function(portf_returns, weights, asset_returns, t_oos, strat_name){
J <- dim(weights)[2] # Extract dimension of weights array
met <- c() # Initialize metrics dataframe
for(j in 1:J){ # Compute the performance of each strategy
temp_met <- perf_met(portf_returns[, j], weights[, j, ], asset_returns, t_oos)
met <- rbind(met,+ temp_met)
}
row.names(met) <- strat_name # Rename of each row
return(met)
}
returns_msr <- temp_data %>% select(Tick, Date, P_return) %>% # Returns matrix of MSR strategy
pivot_wider(names_from = Tick, values_from = P_return)
metrics <- rbind(perf_met_multi(portf_returns, portf_weights, returns, t_oos, strat),
perf_met_multi(portf_returns_msr, portf_weights_msr, returns_msr, t_oos, "MSR"))
# Apply metrics function to all strategies
metrics %>% round(4)
The equally weighted portfolio serves as a benchmark and its metrics as reference. With the lowest turnover, it has the highest transaction cost. This means that its Sharpe ratio embedded some fees to maintain every month the same weight among each asset.
Overall, the performance results of the machine learning model is not impressive, they’re even disappointing. The machine learning based strategy which outperformed the benchmark with the wide dataset (which consists of a core of 30 of the largest US stocks) can’t stand up above the benchmark. Their Sharpe ratios are lower indeed, since they’ve a lower average return and higher volatility. Since we’re trying to predict the top 25%, we faced huge turnover 10.29 and 14.72 and transaction fees. One possible explanation is that we are not looking at the right metric and that we are maybe facing a false negative. A second is that 4 features aren’t enough to explain the behavior of 881 companies around the world, with each company working in specific economic conditions (tax reduction, inflation, GDP, unemployment rate, …).
The MSR strategy ends with the highest Sharpe ratio (1.10) and an average return lower than the benchmark (14.88% compared to 17.98%). This result is aligned with Markovitz theory because the allocation process tries to penalize the volatility in profit to the return, but it’s not sufficient in this case to make additional profit. Its turnover of 20.88 is the highest value among all the portfolios.
The last strategy, based on simple Ordinary Least Squared (OLS) regression ends at first place. With the same features as the machine learning process, it seems to be the best predictor. With an average return of 24.05% and of course a higher volatility than the benchmark (28.36% compared to 20.27%), its Sharpe ratio is more rewarding than the others strategies with a lower turnover. However, it’s the portfolio that experiences the maximum loss from a peak (highest maximum Drawdown).
CAPM <- function(benchmarks, returns, Rf, strat){ # Compute alpha and beta of all the strategies
J <- dim(returns)[2]
met <- c()
for (j in 1:J){
beta <- cov(returns[,j], benchmarks) # Covariance between strategy and benchmark EW
beta <- beta / var(benchmarks)
alpha <- mean(returns[,j]) - Rf - beta*(mean(benchmarks)-Rf) # Additional gain compared to the benchmark
temp_met <- data.frame(beta, alpha)
met <- rbind(met,+ temp_met)
}
rownames(met) <- strat
return(met)
}
Assetmodel <- rbind(CAPM(portf_returns[,1], portf_returns, 0, strat),
CAPM(portf_returns[,1], portf_returns_msr, 0, "MSR"))
Assetmodel %>% round(4)
The alpha is the additional returns seeking by active portfolio managers and the beta is the systematic risk embedded in each strategy. We can see that all the strategies except the MSR are riskier than the benchmark and don’t provide additional return. Since those values are the mean during the trading period, we need to observe the cumulative return to better appreciate the trend of the last 5 years. The MSR strategy has the lowest sigma and indeed the lowest beta \(\beta_i = \rho_{im}*\frac{\sigma_i}{\sigma_m}\). In first impression, we can select it as the less risky strategy, however, we need to considerate the filter choose at each date among the to select the first, fourth group, according to the monthly moving average, which create huge transactions fees (sell the asset if its moving average is below the threshold, buy/hold but change weight if it’s above. Is the alpha of \(5.8‰\) rewarding to cover all the fees? With a performance below the EW, we’re not sure.
The major risk of portfolio allocation model is a false positive. To avoid it, we should carefully choose the strategy to be implemented in the trading platform. In order to correct for in-sample results that seem too optimistic, it is customary to shrink Sharpe ratios by some factor, usually close to 0.5.
Bailey and Lopez de Prado (2014) propose an approach to evaluate the statistical significance of Sharpe ratios when \(M\) strategies have been tested. The following test procedure will assess the relevance of the highest sharpe ratio strategy, the MSR. \[t = \phi\left(\frac{(SR-SR^*)\sqrt{T-1})}{\sqrt{1-\gamma_3SR+\frac{\gamma_4-1}{4}SR^2}} \right),\]
where \(\gamma_3\) and \(\gamma_4\) are the skewness and kurtosis of the returns of the strategy and \(SR^*\) is the theoretical average value of the maximum SR over \(M\) trials:
\[SR^*=\mathbb{E}[SR_m]+\sqrt{\mathbb{V}[SR_m]}\left((1-\gamma)\phi^{-1}\left(1-\frac{1}{M}\right)+\gamma \phi^{-1}\left(1-\frac{1}{Me}\right) \right),\]
where \(\phi\) is the cumulative distribution function (cdf) of the standard Gaussian law and \(\gamma\) is the Euler-Mascheroni constant. The index \(m\) refers to the number of strategy trials and \(M\) is the total numbers of tested strategies. If \(t\) defined above is below a certain threshold (e.g., 0.95), then the \(SR\) cannot be deemed significant.
DSR <- function(SR, Tt, M, g3, g4, SR_m, SR_v){ # First, we build the function
gamma <- -digamma(1) # Euler-Mascheroni constant
SR_star <- SR_m + sqrt(SR_v)*((1-gamma)*qnorm(1-1/M) + gamma*qnorm(1-1/M/exp(1))) # Avg. Max SR
num <- (SR-SR_star) * sqrt(Tt-1) # Numerator
den <- sqrt(1 - g3*SR + (g4-1)/4*SR^2) # Denominator
return(pnorm(num/den))
}
# Next, the parameters:
M <- nrow(metrics) # Number of strategies we tested
SR <- max(metrics$Sharpe_ratio) # The SR we want to test
SR_m <- mean(metrics$Sharpe_ratio) # Average SR across all strategies
SR_v <- var(metrics$Sharpe_ratio) # Std dev of SR
g3 <- skewness(portf_returns_msr) # Function from the e1071 package
g4 <- kurtosis(portf_returns_msr) + 3 # Function from the e1071 package
DSR(SR, Tt, M, g3, g4, SR_m, SR_v) # The sought value!
## [1] 0.576204
Although this value is greater than 0.5 (\(\simeq 0.58\)), but it’s still far from the 0.95 confidence threshold. The SR of our best strategy (only in term of Sharpe ratio) is not statistically significant because its location compared to the distribution of all SR is not enough to the right.
plot_data <- (portf_returns+1) %>% apply(2,cumprod) # Values
plot_data <- rbind(1, plot_data) # Initiate at one
plot_data_msr <- (portf_returns_msr+1) %>% apply(2, cumprod) # Values
plot_data_msr <- rbind(1, plot_data_msr) # Initiate at one
plot_data <- cbind(plot_data, plot_data_msr) # Merge two dataframe
plot_data <- data.frame(t_oos, plot_data) # Adding the date
colnames(plot_data) <- c("Date", strat_all) # Changing column names
graph <- plot_data %>% # Final formatting & plot
pivot_longer(-Date, names_to = "strategy", values_to = "value") %>%
ggplot(aes(x = Date, y = value, color = strategy)) +
geom_line() + theme_light() + geom_point() + labs(y = "Cumulative return") +
ggtitle("Strategies returns") +
scale_colour_brewer(palette = "Set1")
ggplotly(graph) # Ggplotly to manipulate the graph
The representation of the cumulative return exhibits the comments above. The Machine learning portfolio’s allocation ends with returns below 1.8 without ever surpass the benchmark during the trading period which ends at 2.18. The MSR crossed the EW barrier on September 2019 but declines directly. It experiences a lower impact during the Covid crisis, which permit to stand above the EW and the Regression from March 2020 to November 2020 before being overwhelmed by those two.
Conclusion: The aim of the project was to define Machine learning strategies and apply course modules on portfolio allocation and backtesting project. As we can see, we need to carefully choose our hyperparameters to maximize the outcomes of the fitting process without learning too fast on training sample and overfitting on data and lose prediction power. The results may seem disappointing, but they are realistic! It takes experience/experimentation to make ML algorithms deliver good results. To achieve our goal, we need to handle many impactful features (more than 4 of course):
Assuming we have built a good model, which have been tested on various out of sample dataset, we still need to be cautious about a recession period or a crisis. On March 2020, the world experiences the Covid-19 crisis and some companies are still struggling to bounce back 18 months year later. Regardless, the strategy implemented, as portfolio managers, we need to be prepared to handle this situation with a well prepared contingency risk management: long/short vanilla and exotic options, buy swap, explore bond market, …